Data Position and Profiling in Domain-Independent Warehouse Cleaning
نویسندگان
چکیده
A major problem that arises from integrating different databases is the existence of duplicates. Data cleaning is the process for identifying two or more records within the database, which represent the same real world object (duplicates), so that a unique representation for each object is adopted. Existing data cleaning techniques rely heavily on full or partial domain knowledge. This paper proposes a positional algorithm that achieves domain independent de-duplication at the attribute level. The paper also proposes a technique for field weighting through data profiling, which, when used with the positional algorithm, achieves domain-independent cleaning at the record level. Experiments show that the positional algorithm achieves more accurate de-duplication than existing algorithms.
منابع مشابه
Modeling the Data Warehouse Refreshment Process as a Workflow Application
This article is a position paper on the nature of the data warehouse refreshment which is often defined as a view maintenance problem or as a loading process. We will show that the refreshment process is more complex than the view maintenance problem, and different from the loading process. We conceptually define the refreshment process as a workflow whose activities depend on the available pro...
متن کاملEliminating Fuzzy Duplicates in Data Warehouses
1 Work done while visiting Microsoft Research Abstract The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches ...
متن کاملIdentification of Categorical Registration Data of Domain Names in Data Warehouse Construction Task
This work is dedicated to formation of data warehouse for processing of a large volume of registration data of domain names. Data cleaning is applied in order to increase the effectiveness of decision making support. Data cleaning is applied in warehouses for detection and deletion of errors, discrepancy in data in order to improve their quality. For this purpose, fuzzy record comparison algori...
متن کاملOn Data Cleaning In Building XML Data Warehouses
One of the most important aspects in building an XML data warehouse is data cleaning and integration process. This paper presents a detailed methodology for cleaning data and integrating, especially useful for general situations when different-source documents are involved. Both situations whereby the XML documents have an associated XML Schema or they are just independent XML documents are con...
متن کاملA Unified Framework and Sequential Data Cleaning Approach for a Data Warehouse
The data cleaning is the process of identifying and removing the errors in the data warehouse. Data cleaning is very important in data mining process. Most of the organizations are in the need of quality data. The quality of the data needs to be improved in the data warehouse before the mining process. The framework available for data cleaning offers the fundamental services for data cleaning s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003